106

Binary Neural Architecture Search

  

  

Discrepant

Child search

Tangent propagation

based on Parent

Child

Child

Decoupled

optimization

Parent

Parent

FIGURE 4.11

The main framework of the proposed DCP-NAS, where α and ˆα denote real-valued and

binary architecture, respectively. We first conduct the real-valued NAS in a single round

and generate the corresponding tangent direction. Then we learn a discrepant binary ar-

chitecture via tangent propagation. In this process, real-valued and binary networks inherit

architectures from their counterparts, in turn.

whereis the convolution operation. We omit the batch normalization (BN) and activation

layers for simplicity. Based on this, a normal NAS problem is given as

max

w∈W∈A f(w, α),

(4.20)

where f : W ×A →R is a differentiable objective function w.r.t. the network weight w ∈W

and the architecture space A ∈RM×E, where E and M denote the number of edges and

operators, respectively. Considering that minimizing f(w, α) is a black-box optimization,

we relax the objective function to ˜f(w, α) as the objective of NAS

min

w∈W∈A LNAS =˜f(w, α)

=

N



n=1

pn(X) log(pn(w, α)),

(4.21)

where N denotes the number of classes and X is the input data. ˜f(w, α) represents the

performance of a specific architecture with real value weights, where pn(X) and pn(w, α)

denote the true distribution and the distribution of network prediction, respectively.

Binary neural architecture search The 1-bit model aims to quantize ˆw and ˆain into

b ˆw ∈{−1, +1}Cout×Cin×K×K and bˆain ∈{−1, +1}Cin×H×W using the efficient XNOR and

Bit-count operations to replace full precision operations. Following [48], the forward process

of the 1-bit CNN is

ˆaout = βbˆain b ˆw,

(4.22)

whereis the XNOR, and bit count operations anddenotes channelwise multiplication.

β = [β1, · · · , βCout]R+

Cout is the vector consisting of channel-wise scale factors. b =

sign(·) denotes the binarized variable using the sign function, which returns one if the

input is greater than zero and1 otherwise. It then enters several non-linear layers, e.g.,